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We study the efficiency of an improved Multiboson algo- 
rithm with two flavours of Wilson fermions in a realistic phys- 
ical situation (/3 = 5.60, k = 0.156 on a 16^ x 24 lattice). The 
performance of this exact algorithm is compared with that of a 
state-of-the-art HMC algorithm: a considerable improvement 
is obtained for the plaquette auto-correlation time, while the 
two algorithms appear similarly efficient at decorrelating the 
topological charge. 

PACS: 11.15.H, 12.38.Gc, 02.70.Lq, 02.70.-c, 05.10.-a 



I. INTRODUCTION 

The Monte-Carlo simulation of lattice QCD, including 
the effects of dynamical quark loops, is a particularly dif- 
ficult and challenging problem, as it involves the compu- 
tation of the fermion matrix determinant, which appears, 
after integration over the anti-commuting fermion fields, 
in the QCD partition function 



det(^ 



Z = 



\{dU^^, 



-s,[u] 



Nf 

Y[det{l/)[U] 
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The fermion determinant leads to non-local interactions 
among the gauge links Ux,in so that the cost for updating 
all links naively grows at least as 0{V^), where V is the 
lattice volume. 

The standard approach to this problem is given by the 
Hybrid Monte Carlo (HMC) method. The computation 
of the determinant is achieved stochastically by the in- 
troduction of an auxiliary bosonic pseudo-fermion field 
(j). For the case of two degenerate flavours one may write 



|det(^ +m)|2 = f [d(t)^[d<l)] e-l(^+'^ 



^Vl^ 



(2) 



One still has to deal with a non-local action involving 
{p +m)^^, but now, using Molecular Dynamics, a global 
updating of U can be performed. 

An alternative approach, which allows the use of lo- 
cal algorithms, is the so-called Multiboson method, orig- 
inally proposed by Liischer [|l| . If a polynomial P„ (x) = 
CnY[k=ii-'^ ~ -^k) can be found which approximates l/x 
over the whole spectrum of {Ip +m), then Pn(3 + m) « 
{]p + m,)~^. Using a relation similar to Eq. (||) one can 
then write 



m)p « |det"^P„(_^ - 



to)|2 



(3) 



In this way the original QCD partition function is ap- 
proximated by a functional integration over the gauge 
links U and n bosonic fields (pk, where the integration 
measure is given in terms of a local action 



O — O/j 



Ei(^ 



Zk 



(4) 
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so that standard powerful local algorithms (heatbath, 
overrelaxation) can be used. The systematic error de- 
riving from this approximation can be corrected either 
during the simulation, with a global accept-reject step, 
or by a reweighting procedure on physical observables. 

An important question to the lattice scientific commu- 
nity is which of the two approaches (HMC or MB) is more 
efficient for simulating QCD with dynamical fermions. 
No definitive answer is yet available. On general grounds, 
the MB method has the advantage of being still relatively 
new, so that prospects for improvement are much greater. 

The MB approach allows two different strategies for 
improvement. The first is the choice of the approximating 
polynomial. As the number n of bosonic fields increases, 
the work per updating step grows as n. Moreover, the 
autocorrelation time for physical observables also grows 
approximately as n. It is therefore essential to choose a 
polynomial of the lowest possible degree while preserv- 
ing sufficient accuracy in the approximation, i.e. suffi- 
cient acceptance in the Metropolis step. The second is 
the choice of the update algorithm. Since we are dealing 
with a local action, a number of possible efficient local 
algorithms are available. Moreover a global updating of 
the bosonic field variables can be simply implemented, 
since their distribution is gaussian. The coupled dynam- 
ics of the system (gauge links -I- boson fields) is highly 
non trivial, so that the optimal mixture of update algo- 
rithms is essentially determined by numerical experiment 
0. As we will see, a good choice can make a substantial 
difference. 

The aim of the present work is to test the efficiency of 
an improved version of the MB algorithm in a physical 
situation and to directly compare its performance with a 
"state of the art" HMC algorithm, namely the one used 
by the SESAM collaboration S. Clearly, the efficiency 



of an algorithm is different depending on the physical ob- 
servable one is monitoring (i.e., one obtains different in- 
tegrated autocorrelation times for different observables) . 
In the present work we are studying the plaquette and 
the topological charge. These two observables are rep- 
resentative of the smallest- (UV) and largest- (IR) scale 
features of the gauge field. Therefore, their Monte Carlo 
dynamics provide a succinct description of the efficiency 
of a simulation at decorrelating all intermediate scales. 
The study of the topological charge dynamics is of par- 
ticular importance: because of the associated zero-mode 
crossing, it is expected theoretically, and has been shown 
in practice Q , that HMC algorithms near the chiral limit 
can be particularly inefficient at decorrelating topology, 
so that even using a very long simulation time on super- 
computers, one is not able to properly sample the topo- 
logical modes of the theory and ensure ergodicity. The 
efficiency of the MultiBoson method in this respect has 
not been studied yet. 

We have used the exact, non-hermitian version of the 
MB algorithm [gj with two flavours of Wilson fermions. 
In order to reduce the number of bosonic fields we have 
used the method proposed in |g]: after a precondition- 
ing of the Dirac matrix, which rewrites the effect of the 
UV fermionic degrees of freedom as a simple modifica- 
tion of the pure gauge action (UV-filtering) , an optimized 
polynomial is constructed numerically by adapting it, via 
quadratic minimization, to a typical configuration of the 
physical ensemble. 

The dynamics of the algorithm have then been im- 
proved by choosing a proper mixture of local overrelax- 
ation of the boson and gauge fields and of global heatbath 
on the boson fields. The introduction of the global heat- 
bath step (whose computational cost is strongly reduced 
by using a Quasi-Heatbath with approximate inversion 
|[7|) turns out to be essential in order to improve the dy- 
namics of the algorithm at small quark mass |^ . 

In Section II wc present more details about the algo- 
rithm used. In Section III the numerical results of our 
simulations are shown. In Section IV we give our conclu- 
sions. 



II. DESCRIPTION OF THE ALGORITHM 
A. Statics 

We have used the exact, non-hermitian version of the 
MB algorithm H, which includes a noisy Metropolis 
test to correct for the polynomial approximation to the 
fermionic determinant S. 

As the degree n of the approximating polynomial is 
increased, the work per independent configuration grows 
as n^ P] : it is therefore crucial to keep n as low as possi- 
ble while maintaining a good approximation, i.e. a good 
acceptance for the Metropolis test. 



In order to do that we have followed the procedure 
of [p[. Inspired by the loop expansion of the fermion 
determinant, 



det(l - nM) = e 



,f) - pTr Log(l-KM) 



one rewrites the determinant as 
det(l - kM) = 

e ^j X det (1 - kM) e ^i 



E,4TrM'^ (5) 

(6) 



where the identity det e^ = e"'"'' '^ has been used. 

The idea behind this is to filter out the UV part of the 

— y^ a-TrAP 

Dirac spectrum: the term e ^^^ ^ , which adds a set 

of small loops to the gauge action, accounts for the UV 
modes of the fermion determinant. In this way the zeros 
of the new polynomial (i.e. the one which approximates 

the inverse of (1 — KAf) e ^3°^ ) can be concentrated 
near the small eigenvalues (corresponding to IR modes) 
and a better approximation can be obtained with less 
cost. 

The number of coefficients {qj} as well as their values 
can be optimized. It is easy to see that for j < 4 the term 

can be reabsorbed in a shift of the coupling 
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(3 of the pure gauge Wilson action, A/3 = 192K*a4, so that 
no computational overhead is incurred. 

For the optimization of the coefficients {%} and of the 
zeros {zk} of the polynomial we have followed the proce- 
dure suggested in |g] , which we describe here briefly. The 
parameters have to be chosen so that det VF ~ 1, where 

W = nr(l - i^M - Zfel) • (1 - kM) ■ 6^7=0"^^^^', A 
sufficient condition for this is that Wr] ~ 77 for Gaussian 
vectors ?/. Therefore one takes an equilibrium gauge con- 
figuration, fixes the coefficients aj to some initial value, 
draws one or more Gaussian vectors rj and finds, by 
quadratic minimization, the roots {zk} which minimize 
the quantity e =|| Wr] — rj |p. 

During this procedure also de/daj can be computed, 
so that one can repeat this optimization using new values 
for aj , and thus minimize also with respect to the Gj 's by 
Newton's method. 

In principle one should perform an average over differ- 
ent equilibrium gauge fields. In practice results do not 
change appreciably with the gauge field, so that one sin- 
gle configuration gives sufficient information. 

An interesting byproduct of the optimization is that it 
is possible to estimate the acceptance for the Metropolis 
test expected during the actual Monte Carlo simulation. 
The acceptance probability for {Uoid} -^ {Unew} is 



mm(l, (- 



-lipoid r;|^ + h|" 



)v) 



(7) 



and a good estimate for this is erfc(c || WoidTj — rj |j), with 
c ~ 0(1). This estimate can be directly obtained as a 
result of the optimization procedure. 



B. Dynamics 
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As we will show in the next Section, the UV filtering is 
very effective in reducing the number of fields by a factor 
3 or more. The improvement may be really impressive for 
heavy quarks, since in this case the small loops account 
for most of the dynamical effects. 

However, as the quark mass decreases, one must in- 
clude larger loops in the gauge action or increase the 
number of boson fields. Because the multiplicity of larger 
Wilson loops increases combinatorically, the best com- 
promise (see next Section) limits the loop expansion to 
small loops and increases the number of fields, which 
eventually diverges as l/ruq |5|. In this case the dynamics 
of the system may be highly non trivial and the proper 
choice of the algorithm, or mixture of algorithms, may 
become very important. 

In particular, for light quark masses it is more likely 
for one of the zeros of the polynomial to get very close 
to the Dirac spectrum boundary, and so for the corre- 
sponding boson field to become almost massless. The 
dynamics of those IR boson fields then becomes critical. 
They represent the bottleneck of the dynamical evolution 
of the whole system. One has then to search for a good 
algorithm in order to speed up those slow modes. 

Due to the simple form of the bosonic distribution. 



P{cj,k) ^ cxp {-\{D - Zk) 



(8) 



a global heatbath on the bosonic fields can be simply 
implemented S and turns out to be very effective. The 
new field 0^^ is obtained by applying {D — Zk)^^ to a 
Gaussian vector rj 



^NEW 



iD-Zk)-^r]. 



(9) 



In this way (j)^^^ is completely uncorrelated from the 
old bosonic field. 

In practice, since an accurate exact inversion of {D — 
Zk) is needed, the global heatbath may become pro- 
hibitively expensive, especially for those fields where Zk 
is very close to the edge of the Dirac spectrum, so that 
{D — Zk) is almost singular. 

In order to cure this problem we have adopted, in- 
stead of the usual heatbath with exact inversion, a quasi- 
heatbath consisting of an approximate inversion plus a 
Metropolis accept-reject step as proposed in |7J. 

The idea is not to perform an exact inversion of {D — 
Zk), but to stop the inversion algorithm (BiCG in our 
case) early by loosening the convergence criterion. In 
order to preserve detailed balance, the system one solves 
approximately is 



{D - Zk)x = b, 



(10) 



with right-hand side b ^ {D — Zk)4'k^^ + V- The residual 
is r = b — {D — Zk)x. A candidate bosonic field is then 
formed as 



It is accepted with probability 

rmn (l, exp(2i?e(rt.(i? - z,)((/.r'^ - ^o^D)) (^3) 

It is easily proven that in this way the bosonic field is 
sampled with the correct distribution, for any conver- 
gence criterion of the solver. Using this simple trick it 
is possible to strongly reduce the number of iterations of 
the inversion algorithm (by a factor 3 or so) while main- 
taining an acceptance for cj)^^^ close to 1 [Q. The use 
of even-odd preconditioning further reduces the compu- 
tational demand. 

This global update of the bosonic fields has then to be 
combined with local update algorithms. We have chosen 
local overrelaxation for both gauge and bosonic fields[|. 

A good tuning of the relative frequencies of the three 
different algorithms in the mixture is essential. With 
the global heatbath some new uncorrelated information 
is brought into the statistical system via the boson fields, 
which of course needs some time to propagate to the 
gauge fields too. While a stochastic choice (with proper 
weights) among the three algorithms might seem prefer- 
able, we have observed that the use of a deterministic 
sequence of algorithms in the trajectory between two suc- 
cessive Metropolis steps is much more effective. 



III. NUMERICAL RESULTS 

Three representative systems, increasingly demanding 
in computer resources, have been studied: medium-heavy 
quarks in a small lattice, light quarks in a small lattice, 
and light quarks in a large lattice. In all three cases 
the efficiency of our algorithm at least matched that of 
HMC. Our simulations are of length O(10^)ri„t(n), suffi- 
cient to extract reasonably accurate (0(10%)) integrated 
autocorrelation times Ti„t{0) for the plaquette. 

In all cases we simulate 2 flavours of Wilson fermions, 
and use the non-hermitian version of the MultiBoson 
algorithm |5| with even-odd preconditioning and noisy 
Metropolis correction test. The gauge action is the Wil- 
son plaquette action. 

We have implemented UV-filtering up to fourth order 
(i.e. up to 4- link loops), which generates a shift A/3 in 
the gauge coupling /3, but causes no overhead. 

The details of the three simulations, denoted (A),(B) 
and (C), are summarized in Table I. Note that the opti- 
mized value of the coefficient 04, and thus A/3, is much 



^Note that no heatbath on gauge fields is needed, since er- 
godicity is already ensured by the global heatbath on bosonic 
fields 



larger than the hopping parameter expansion value (1/4). 
This is because UV-filtering does its best to approximate 
the infinite series given by the hopping parameter ex- 
pansion with a series truncated to 4th order: the best 
truncated series is not the truncation of the infinite one. 
When performing the Metropolis test between two tra- 
jectories the quantity Wr], where W = 11^(1 ~ ^-^^ 



Zkl) ■ (1 - nM) 



a2^j=a °J ^ has to be computed, and 
some care is needed in the evaluation of the exponential 

^ = e^i=o ' rj. Our method is to compute it by Tay- 
lor expansion, stopping the series when the contribution 
of the first neglected order to ^ is less than 10""'^'' for each 
site. 

For MB simulations (A) and (B), the order of update 
of the gauge links was not the usual one. All 8 links at- 
tached to a given site x, forming a "star" pattern, were 
updated before proceeding to another set of 8 links. This 
arrangement allows a re-use of intermediate link-products 
in the calculation of the gauge force, thereby reducing the 
overall amount of computation jl^l • More importantly, it 
becomes very simple in this scheme to integrate analyti- 
cally over the central bosonic fields (j)k {x) , which permits 
a larger-step update of the gauge fields. This provides 
similar advantages to the combined gauge-boson update 
of P], without the overhead. 

The HMC simulations used for comparisons incorpo- 
rate state-of-the-art improvements: even-odd precondi- 
tioning ((A), (B) and (C)); incomplete convergence of the 
solver during the MD trajectory ((A) and (B)) or time- 
extrapolation of the initial guess ((C)); BiCG75 ((A) and 
(B)) or BiCGStab ((C)) solver; multi-stepsize integra- 
tion, following |l^, in ((A) and (B)). Simulation (C) is 
a SESAM-coUaboration production run Q. It does not 
incorporate, however, SSOR preconditioning which they 
use with advantage in more recent projects where it re- 
duces the work per independent configuration by a factor 
of about two |12|. 



TABLE I. Summary of the parameters of our 3 simula- 
tions. The optimized coefficients 02 and 04 and A/3 have been 
rounded off to 3 decimals. Integrated autocorrelation times 
are measured in applications of the Wilson Dirac matrix to a 
vector. 



Simulation 


A 


B 


C 


Volume 


8^ 


8* 


le'' X 24 


/3,K 


5.3,0.158 


5.3,0.165 


5.6,0.156 


Tlhosons 


7 


16 


24 


a2 


4.411 


6.066 


4.077 


ttA 


1.389 


4.423 


8.789 


A/3 = 192K*a4 


0.166 


0.629 


0.999 


n„t{a){MB) 


~3500 


~ 64000 


~ 27500 


n„t{a){HMC) 


~ 14000 


~ 72000 


~ 85000+ 



f Obtained from the SESAM-coUaboration data on a 
163 X 32 lattice B. 



A. Medium-heavy quarks, small lattice 

The parameters of the simulation are indicated in Ta- 
ble I, column (A). Seven auxiliary fields only were re- 
quired after UV-filtering (note that there is no need for 
the number of fields to be even). Fig.l shows the lo- 
cation of the associated complex zeros of the approxi- 
mating polynomial, together with the boundary of the 
Dirac spectrum estimated by diagonalizing the tridiago- 
nal matrix given by the BiCG75 solver. Only one zero 
is devoted to controlling the UV part of the Dirac spec- 
trum, while the other 6 control the IR. This is because 
the shift A/? « 0.166 in the gauge couphng accounts for 
most of the UV fluctuations already. Because none of the 
auxiliary fields has a small mass, as can be judged from 
the distance of the zeros to the boundary of the Dirac 
spectrum Fig.l, local and global updates of these fields 
are about equally efficient. In both cases, the plaque- 
tte decorrelates about 4 times faster than with HMC, as 
shown in Fig. 2. This good result is inherent to the MB 
approach: for heavy quarks, it reduces to pure gauge lo- 
cal updates, whereas HMC remains an infinitesimal-step 
algorithm with much slower dynamics |13|] . Therefore, 
the relevant question is: at which quark mass, if any, 
does the MB approach lose its advantage? This is the 
reason for our second test. 



8* ^ = 5,3 /c = 0.15B 



FIG. 1. Zeros of the UV-filtered polynomial (circles), and 
estimated boundary of the Dirac spectrum (crosses) , simula- 
tion (A). 




FIG. 2. Autocorrelation of the plaquette, with the 
UV-filtered MB algorithm (solid Une) and the HMC algorithm 
(dashed line) at /? = 5.3 and k = 0.158 (simulation A). 



B. Light quarks, small lattice 

The parameters of the simulation are indicated in Ta- 
ble I, column (B). Hc is approximately 0.1686(3) |lj], so 
we are simulating light quarks. As appears clearly in 
Fig. 3, several of the IR zeros are now closer to the Dirac 
spectrum boundary. Consequently, the global Quasi- 
Heatbath provides a large improvement over local up- 
dates, by a factor 5 or more. The number of boson fields, 
n = 16, is very much smaller than the number of solver 
iterations per HMC step (~ 53). The 2 algorithms ap- 
pear about equally efficient (see Fig. 4). Since the work 
per independent configuration grows more slowly with 
the volume for MB than HMC {V{Log{V))^ vs V^/'^ ||), 
this bodes well for realistic simulations in large volumes. 



13 = 5:3 ^; = 0.165 



FIG. 3. Zeros of the UV- filtered polynomial (circles), and 
estimated boundary of the Dirac spectrum (crosses) , simula- 
tion (B). 
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FIG. 4. Autocorrelation of the plaquette, with the 
UV-filtered MB algorithm (solid line) and the HMC algorithm 
(dashed line) at /? = 5.3 and n — 0.165 (simulation B). 



C. Light quarks, large lattice 

The parameters of the simulation are indicated in Ta- 
ble I, column (C). This large-scale simulation has been 
performed on the APE/TOWER machine in Pisa. 



The number of boson fields, n = 24, is again very much 
smaller than the number of BiCG iterations per MD step 
{r-^ 91). Note that without UV-filtering, these 2 numbers 
become comparable: we needed n = 80 fields in this case, 
with the zeros of the polynomial evenly spaced on the 
unit circle centered at (1,0), to reach similar acceptance 
(~ 74%). 

Regarding the mixture of algorithms, we have found 
that a good compromise is to perform a global heatbath 
per bosonic field or each symmetrized trajectory com- 
posed of 12 couples of local over-relaxation sweeps for 
the bosonic and gauge fields. This choice may be subject 
to further optimization. 

Of the optimized roots of the UV-filtered polynomial, 
shown in Fig. 5, 16 are devoted to IR modes and 8 to UV 
modes of the Dirac operator. 
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FIG. 5. Zeros of the UV-filtered poly- 

nomial (n = 24) (circles), of the non- UV-filtered polynomial 
(n — 80)(+), and estimated boundary of the Dirac spectrum 
(crosses), simulation (C). 



To illustrate the benefits of UV-filtering and of the 
bosonic quasi-heatbath, we show in Fig. 6 the evolution of 
the plaquette, first without UV-filtering {n = 80 bosonic 
fields), then with UV-filtering {n — 24) but without 
quasi-heatbath, and finally with both features. The im- 
provement is clearly visible in each case. The autocor- 
relation of the plaquette is compared in Fig. 7 with that 
obtained by the SESAM collaboration using HMC [||. 
Our MB approach is more efficient by a factor ^ 3. 

In Fig. 8 we compare the topological charge histo- 
ries obtained from our simulation and from a sample of 
SESAM configurations ||l5[ . In both cases the same cool- 
ing method was used. Neither simulation is long enough 
to extract a reliable autocorrelation time for this ob- 
servable. Attempts at doing so yield roughly equivalent 
results for both algorithms, as the figure already indi- 
cates. Therefore, even for this global observable, our MB 
method seems not worse than HMC. 
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We considered extending the UV- filtering to order 6. 
This would allow a further reduction of the number of 
bosonic fields, at the expense of including in the action 
6-link loops coming from Tr{M^). The optimization of 
the UV-filtered polynomial was performed under these 
premises. It indicated that the same accuracy obtained 
with n — 24 and 4th-order UV-filtering could be obtained 
with n « 20 and 6th-order UV-filtering. This relatively 
small reduction in n did not justify the overhead of in- 
cluding 6-link loops in the update. 



IV. CONCLUSIONS 



FIG. 6. Monte Carlo history of the plaquette, with the 
non-hermitian MB algorithm (I), then with UV-filtering (II), 
finally with UV-filtering and bosonic quasi-heatbath (III) 
(simulation C). 
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FIG. 7. Autocorrelation of the plaquette, with the 
UV-filtered MB algorithm (solid line) and the SESAM HMC 
algorithm (dashed line) at /? = 5.6 and k = 0.156 (simulation 
C). 




0x10 



6x10' 



8x10' 



FIG. 8. Comparison of topological charge histories ob- 
tained with HMC and UV-fihered MB algorithm at /3 = 5.6 
and K = 0.156. The common scale has been set in terms of 
equivalent D x v multiplications. 



We have presented numerical evidence that the exact, 
non-hermitian MB algorithm is a superior alternative to 
the traditional HMC: it decorrelates the plaquette more 
efficiently, and the topological charge equivalently well. 
The key ingredients to achieve this efficiency are UV- 
filtering and global quasi-heatbath of the boson fields. 
The former absorbs the UV modes of the Dirac opera- 
tor in the gauge action, and thereby reduces the required 
number of bosonic fields by a factor 3 or more, thus re- 
moving the memory bottleneck of the non-filtered MB. 
The latter greatly accelerates, at low computer cost, the 
dynamics of the IR, low-mass boson fields. The same 
scheme can be implemented without changes to simula- 
tions with staggered fermions. Similar efficiency gains 
are expected. At the same time, the MB polynomial 
can be tailored to approximate any number of staggered 
flavours, for instance Nf — 2. With a correction step 
as used for Nj = 1 Wilson-quark simulations fl^, this 
will allow for exact, efficient simulation of two staggered 
flavours, which are inaccessible to HMC. 
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